Discretization of Continuous Attributes

نویسندگان

  • Fabrice Muhlenbach
  • Ricco Rakotomalala
چکیده

In the data-mining field, many learning methods — such as association rules, Bayesian networks, and induction rules (Grzymala-Busse & Stefanowski, 2001) — can handle only discrete attributes. Therefore, before the machine-learning process, it is necessary to re-encode each continuous attribute in a discrete attribute constituted by a set of intervals. For example, the age attribute can be transformed in two discrete values representing two intervals: less than 18 (a minor) and 18 or greater. This process, known as discretization, is an essential task of the data preprocessing not only because some learning methods do not handle continuous attributes, but also for other important reasons. The data transformed in a set of intervals are more cognitively relevant for a human interpretation (Liu, Hussain, Tan, & Dash, 2002); the computation process goes faster with a reduced level of data, particularly when some attributes are suppressed from the representation space of the learning problem if it is impossible to find a relevant cut (Mittal & Cheong, 2002); the discretization can provide nonlinear relations — for example, the infants and the elderly people are more sensitive to illness. This relation between age and illness is then not linear — which is why many authors propose to discretize the data even if the learning method can handle continuous attributes (Frank & Witten, 1999). Lastly, discretization can harmonize the nature of the data if it is heterogeneous — for example, in text categorization, the attributes are a mix of numerical values and occurrence terms (Macskassy, Hirsh, Banerjee, & Dayanik, 2001). An expert realizes the best discretization because he can adapt the interval cuts to the context of the study and can then make sense of the transformed attributes. As mentioned previously, the continuous attribute “age” can be divided in two categories. Take basketball as an example; what is interesting about this sport is that it has many categories: “mini-mite” (under 7), “mite” (7 to 8), “squirt” (9 to 10), “peewee” (11 to 12), “bantam” (13 to 14), “midget” (15 to 16), “junior” (17 to 20), and “senior” (over 20). Nevertheless, this approach is not feasible in the majority of machine-learning problem cases because there are no experts available, no a priori knowledge on the domain, or, for a big dataset, the human cost would be prohibitive. It is then necessary to be able to have an automated method to discretize the predictive attributes and find the cut-points that are better adapted to the learning problem. Discretization was little studied in statistics — except by some rather old articles considering it as a special case of the one-dimensional clustering (Fisher, 1958) — but from the beginning of the 1990s, the research expanded very quickly with the development of supervised methods (Dougherty, Kohavi, & Sahami, 1995; Liu et al., 2002). Lately, the applied discretization has affected other fields: An efficient discretization can also improve the performance of discrete methods such as the association rule construction (Ludl & Widmer, 2000a) or the machine learning of a Bayesian network (Friedman & Goldsmith, 1996). In this article, we will present the discretization as a preliminary condition of the learning process. The presentation will be limited to the global discretization methods (Frank & Witten, 1999), because in a local discretization, the cutting process depends on the particularities of the model construction — for example, the discretization in rule induction associated with genetic algorithms (Divina, Keijzer, & Marchiori, 2003) or lazy discretization associated with naïve Bayes classifier induction (Yang & Webb, 2002). Moreover, even if this article presents the different approaches to discretize the continuous attributes, whatever the learning method may be used, in the supervised learning framework, only discretizing the predictive attributes will be presented. The cutting of the attributes to be predicted depends a lot on the particular properties of the problem to treat. The discretization of the class attribute is not realistic because this pretreatment, if effectuated, would be the learning process itself.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dynamic Discretization of Continuous Attributes

Discretization of continuous attributes is an important task for certain types of machine learning algorithms. Bayesian approaches, for instance, require assumptions about data distributions. Decision Trees, on the other hand, require sorting operations to deal with continuous attributes , which largely increase learning times. This paper presents a new method of discretization, whose main char...

متن کامل

Global discretization of continuous attributes as preprocessing for machine learning

Real-life data usually are presented in databases by real numbers. On the other hand, most inductive learning methods require a small number of attribute values. Thus it is necessary to convert input data sets with continuous attributes into input data sets with discrete attributes. Methods of discretization restricted to single continuous attributes will be called local, while methods that sim...

متن کامل

Discretization of Continuous-valued Attributes and Instance-based Learning

Recent work on discretization of continuous-valued attributes in learning decision trees has produced some positive results. This paper adopts the idea of discretization of continuous-valued attributes and applies it to instance-based learning (Aha, 1990; Aha, Kibler & Albert, 1991). Our experiments have shown that instance-based learning (IBL) usually performs well in continuous-valued attribu...

متن کامل

Compression-Based Discretization of Continuous Attributes

Discretization of continuous attributes into ordered discrete attributes can be beneecial even for propositional induction algorithms that are capable of handling continuous attributes directly. Beneets include possibly large improvements in induction time, smaller sizes of induced trees or rule sets, and even improved predictive accuracy. We deene a global evaluation measure for discretization...

متن کامل

Hierarchical Discretization of Continuous Attributes Using Dynamic Programming

The area of Knowledge discovery and Data mining is growing rapidly. A large number of methods are employed to mine knowledge. Several of the methods rely of discrete data. However, most datasets used in real application have attributes with continuous values. To make the data mining techniques useful for such datasets, discretization is performed as a pre-processing step. Discretization is a pr...

متن کامل

An Evolution Strategies Approach to the Simultaneous Discretization of Numeric Attributes

Many data mining and machine learning algorithms require databases in which objects are described by discrete attributes. However, it is very common that the attributes are in the ratio or interval scales. In order to apply these algorithms, the original attributes must be transformed into the nominal or ordinal scale via discretization. An appropriate transformation is crucial because of the l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015